Contact:
Peter-Paul de Wolf
Statistics Netherlands
P.O. Box 24500
2490 HA The Hague
The Netherlands
Phone: +31 70 337 5060
Last update: 5 April 2012
|
Task 2.Case studies on tabular data
Objectives
The issue of harmonisation of SDC methods for tabular data in the ESSNET has to be addressed at two main stages: at stage 1 the "data owners" (usually the NSI of a member state) use some SDC methods to protect tabular data which have to be delivered to Eurostat under some European regulation. After computing the European aggregates, at stage 2, Eurostat applies SDC methods to the European level data to ensure protection of data identified as "confidential at stage 1.
Especially in the case of business data, mainly cell suppressions methodology has been used at both stages, traditionally. Recently, rounding/perturbation methods have been proposed to replace cell suppression methods at stage 2, and have been explored and tested on behalf of Eurostat."
This task aims to describe a methodological framework for the harmonisation of SDC methods for economic tabular data based on cell suppression, to demonstrate alternative approaches for perturbation based SDC methods and to give recommendations for future harmonised development of these methods.
With respect to the issue of the upcoming census 2011, we aim at first evaluation of individual country checks on confidentiality.
Description of work
Task 2-1: Case studies cell suppression
While with business statistics tables most NSIs in the ESS use cell suppression methodology for SDC, implementations of this method differ widely between different data sets and between countries. This is not merely an issue of using different parameters, or different software: a coherent cell suppression approach requires that the set of tables to be released from a given dataset has to be analysed. Relations between table cells have to modelled, assumptions have to be made on the a priori knowledge of possible intruders, priorities in the publication have to be identified, sampling and data quality aspects to be taken into account, etc. This kind of analysis is an inherent step of any cell suppression procedure. Depending on the dataset and publication, its complexity can vary from rather trivial considerations to a quite challenging analysis. The outcome of the analysis in this task is the definition of a set of linked tables to be handled by the cell suppression procedure. The same holds for the European dimension, e.g. at stage 2. Hence, a discussion on harmonisation must not focus only on software and parameters, but has to take into account this modelling process. Capobianchi and Franconi, (2009) and Virgili and Franconi, (2009) provide case studies, describing this kind of process for two key European statistics, e.g. for the SBS and FATS from the member state perspective, e.g. on stage 1 for Italy. De Wolf and Hundepool, (2010) and Schmidt and Giessing, (2010) explain how to process sets of linked tables efficiently with recent software tools.
Task 2-1a will follow the approaches from Capobianchi and Franconi, (2009) and Virgili and Franconi, (2009), describing and comparing the analytical process and its outcome for several project partner countries (Germany, Netherlands) for relevant parts of the SBS (f.i. manufacturing industry) and FATS publication. Description of current implementation for SBS is also envisaged to share the experiences on the production side.
Task 2-1b will use the software implementations in τ-ARGUS De Wolf and Hundepool, (2010) and the wrapper function from SAS to t ARGUS developed by Schmidt and Giessing, (2010) to process data sets resulting from task 2-1a. Destatis will also explain how to use implementation Schmidt and Giessing, (2010) on stage 2 and compare this approach to the one of the Eurostat software CIF.
Task 2-1c should draw conclusions from the results of tasks 2-1a and b, coming up with ideas and recommendations for essential building blocks of a methodological framework for the harmonisation of tabular SDC methods based on cell suppression.
Task2-2: Case studies on perturbative methods
Even with perfectly harmonised cell suppression methodology, cell suppression at stage 2 usually leads to comparatively large losses of information. Giessing, Hundepool and Castro, (2007 and 2010) and Hundepool, Giessing, Castro, (2008a and 2008b) suggest and discuss an alternative methodology based on rounding/perturbation using Controlled Tabular Adjustment Castro, (2006). Results of these experiments were mostly promising. On the other hand, the analytical process described above remains as complex as with cell suppression, and hence the effort for implementation of suitable harmonised procedures. It may eventually be possible to reduce this complexity by introducing stochastic perturbation methods using the idea of consistent record keys initially proposed by the Australian Bureau of Statistics for population Census data Fraser and Wooton, (2006) into the context of business statistics. In principle, this kind of methodology could be of interest at stage 1 as well. Another alternative would be a pre-tabular perturbation method as suggested in Evans et al. (1998) which naturally rather suits implementation at stage 1.
Task 2-2a will introduce the new stochastic perturbation methods and provide a demonstration for both stages (1 and 2), using data sets of task 2-1 and compare results to cell suppression and the rounding/perturbation method, respectively.
Task 2-2b will evaluate the results of task 2a and give a recommendation for a framework for the development of perturbation based SDC methods.
Task2-3: Case studies on census tables
In 2011 all EU member states have to conduct a Census again. This is for most NSIs a major operation that involves a lot of work and high costs. All countries have to validate and protect the Census output in the form of so-called hypercubes (high-dimensional tables). Even the formats used for the data will differ from country to country. However, in the end all data have to be transformed to SDMX format and offered to Eurostat. Eurostat will send the DSDs (Data Structure Definitions) for the delivery in October of 2010. Finally, Eurostat will check the quality of all validated and confidentialised Census hypercubes with a programme that still has to be built. This programme will not be ready on time for this ESSnet on common tools and harmonised methodology for SDC in the ESS, but individual country checks on confidentiality could be evaluated within the context of this new ESSnet. Although the actual delivery deadline of all hypercubes to Eurostat is only in 2014, in 2011 the first real tables will become available. All countries will in addition have lots of preliminary and test Census tables. These will be used to test the new developments in tabular data protection.
It would be very profitable if European countries could learn from each others Census confidentiality approach. Now we face the risk that many countries stay at the safe side and protect too much information. This could hamper the calculation of European totals. Also the situation where all countries suppress another subtotal will lead to the problem that no sub-totals can be calculated at the European level. By including this activity in the current ESSnet an enormous step forward will be made in harmonising the protected output between European countries. This way more and better comparable Census output can be produced with minimum information loss. In addition it will lead to more publishable European totals than in the current situation.
Task 2-3a will analyse a number of test Census hypercubes and protect these tables according to different rules and methods (in particular cell suppression and rounding) available in τ-ARGUS.
Task 2-3b will evaluate the results of task 2-3a and give recommendations how to protect the harmonised EU Census tables of the member states with minimum information loss using state of the art SDC methods while respecting the different legal frameworks in Europe.
Task 2-3 Workshop on Statistical Disclosure Control of Census data
|